Redoing Weka Stuff

In this section we will try to redo some of the things we have already done in Weka. Objective: To try out some familiar algorithms for classification and regression in python using its libraries.

Imports

I always try to import all the useful libraries upfront. It is also considered a good practice in programming community.


In [1]:
%matplotlib inline
import numpy as np
from scipy.io import arff
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import patsy
import statsmodels.api as sm

from sklearn import tree, linear_model, metrics, dummy, naive_bayes, neighbors

from IPython.display import Image
import pydotplus

In [2]:
sns.set_context("paper")
sns.set_style("ticks")

def load_arff(filename):
    data, meta = arff.loadarff(filename)
    df = pd.DataFrame(data, columns=meta.names())
    for c, k in zip(df.columns, meta.types()):
        if k == "nominal":
            df[c] = df[c].astype("category")
        if k == "numeric":
            df[c] = df[c].astype("float")        
    return df

def get_confusion_matrix(clf, X, y, verbose=True):
    y_pred = clf.predict(X)
    cm = metrics.confusion_matrix(y_true=y, y_pred=y_pred)
    clf_report = metrics.classification_report(y, y_pred)
    df_cm = pd.DataFrame(cm, columns=clf.classes_, index=clf.classes_)
    if verbose:
        print clf_report
        print df_cm
    return clf_report, df_cm

def show_decision_tree(clf, X, y):
    dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=X.columns,  
                         class_names=y.unique(),  
                         filled=True, rounded=True,  
                         special_characters=True, impurity=False)  
    graph = pydotplus.graph_from_dot_data(dot_data)  
    return Image(graph.create_png())


def plot_decision_regions(clf, X, y, col_x=0, col_y=1,
                          ax=None, plot_step=0.01, colors="bry"):
    if ax is None:
        fig, ax = plt.subplots()
    x_min, x_max = X[col_x].min(), X[col_x].max()
    y_min, y_max = X[col_y].min(), X[col_y].max()
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    b, Z = np.unique(Z, return_inverse=True)
    Z = Z.reshape(xx.shape)
    cs = ax.contourf(xx, yy, Z, cmap=plt.cm.Paired)
    for i, l in enumerate(clf.classes_):
        idx = np.where(y==l)[0]
        ax.scatter(X.ix[idx, col_x], X.ix[idx, col_y], label=l, c=colors[i], cmap=plt.cm.Paired)
    ax.set_xlabel(col_x)
    ax.set_ylabel(col_y)
    ax.legend(bbox_to_anchor=(1.2, 0.5))
    fig.tight_layout()
    return ax

In [3]:
df = load_arff("../data/iris.arff")
print df.shape
df.head()


(150, 5)
Out[3]:
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

In [4]:
df.dtypes


Out[4]:
sepallength     float64
sepalwidth      float64
petallength     float64
petalwidth      float64
class          category
dtype: object

Feature creations - Math Expressions


In [5]:
df_t = df.copy() ## Since we are going to edit the data we should always make a copy

In [6]:
df_t.head()


Out[6]:
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

In [7]:
df_t["sepallength_sqr"] = df_t["sepallength"]**2 ## ** in python is used for exponent.
df_t.head()


Out[7]:
sepallength sepalwidth petallength petalwidth class sepallength_sqr
0 5.1 3.5 1.4 0.2 Iris-setosa 26.01
1 4.9 3.0 1.4 0.2 Iris-setosa 24.01
2 4.7 3.2 1.3 0.2 Iris-setosa 22.09
3 4.6 3.1 1.5 0.2 Iris-setosa 21.16
4 5.0 3.6 1.4 0.2 Iris-setosa 25.00

In [8]:
df_t["sepallength_log"] = np.log10(df_t["sepallength"])
df_t.head()


Out[8]:
sepallength sepalwidth petallength petalwidth class sepallength_sqr sepallength_log
0 5.1 3.5 1.4 0.2 Iris-setosa 26.01 0.707570
1 4.9 3.0 1.4 0.2 Iris-setosa 24.01 0.690196
2 4.7 3.2 1.3 0.2 Iris-setosa 22.09 0.672098
3 4.6 3.1 1.5 0.2 Iris-setosa 21.16 0.662758
4 5.0 3.6 1.4 0.2 Iris-setosa 25.00 0.698970

Creating many features at once using patsy


In [9]:
df_t = df_t.rename(columns={"class": "label"})
df_t.head()


Out[9]:
sepallength sepalwidth petallength petalwidth label sepallength_sqr sepallength_log
0 5.1 3.5 1.4 0.2 Iris-setosa 26.01 0.707570
1 4.9 3.0 1.4 0.2 Iris-setosa 24.01 0.690196
2 4.7 3.2 1.3 0.2 Iris-setosa 22.09 0.672098
3 4.6 3.1 1.5 0.2 Iris-setosa 21.16 0.662758
4 5.0 3.6 1.4 0.2 Iris-setosa 25.00 0.698970

In [13]:
y, X = patsy.dmatrices("label ~ petalwidth + petallength:petalwidth + I(sepallength**2)-1", data=df_t, return_type="dataframe")
print y.shape, X.shape


(150, 3) (150, 3)

In [14]:
y.head()


Out[14]:
label[Iris-setosa] label[Iris-versicolor] label[Iris-virginica]
0 1.0 0.0 0.0
1 1.0 0.0 0.0
2 1.0 0.0 0.0
3 1.0 0.0 0.0
4 1.0 0.0 0.0

In [15]:
X.head()


Out[15]:
petalwidth petallength:petalwidth I(sepallength ** 2)
0 0.2 0.28 26.01
1 0.2 0.28 24.01
2 0.2 0.26 22.09
3 0.2 0.30 21.16
4 0.2 0.28 25.00

In [16]:
model = sm.MNLogit(y, X)
res = model.fit()
res.summary()


Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.053951
         Iterations: 35
/content/smishra8/SOFTWARE/anaconda2/envs/datamining/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.py:580: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  start_params = np.zeros((self.K * (self.J-1)))
/content/smishra8/SOFTWARE/anaconda2/envs/datamining/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.py:1840: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  params = params.reshape(self.K, -1, order='F')
/content/smishra8/SOFTWARE/anaconda2/envs/datamining/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.py:1756: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  params = params.reshape(self.K, -1, order='F')
/content/smishra8/SOFTWARE/anaconda2/envs/datamining/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.py:1697: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  params = params.reshape(self.K, -1, order='F')
/content/smishra8/SOFTWARE/anaconda2/envs/datamining/lib/python2.7/site-packages/statsmodels/base/model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
/content/smishra8/SOFTWARE/anaconda2/envs/datamining/lib/python2.7/site-packages/statsmodels/discrete/discrete_model.py:588: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  mnfit.params = mnfit.params.reshape(self.K, -1, order='F')
Out[16]:
MNLogit Regression Results
Dep. Variable: y No. Observations: 150
Model: MNLogit Df Residuals: 144
Method: MLE Df Model: 4
Date: Thu, 13 Oct 2016 Pseudo R-squ.: 0.9509
Time: 19:13:27 Log-Likelihood: -8.0926
converged: False LL-Null: -164.79
LLR p-value: 1.394e-66
y=label[Iris-versicolor] coef std err z P>|z| [95.0% Conf. Int.]
petalwidth 5.1590 1.69e+05 3.06e-05 1.000 -3.3e+05 3.3e+05
petallength:petalwidth 17.6445 1.78e+04 0.001 0.999 -3.49e+04 3.5e+04
I(sepallength ** 2) -1.5079 3363.455 -0.000 1.000 -6593.758 6590.742
y=label[Iris-virginica] coef std err z P>|z| [95.0% Conf. Int.]
petalwidth -18.1969 1.69e+05 -0.000 1.000 -3.3e+05 3.3e+05
petallength:petalwidth 24.2166 1.78e+04 0.001 0.999 -3.49e+04 3.5e+04
I(sepallength ** 2) -1.8904 3363.455 -0.001 1.000 -6594.141 6590.360

In [18]:
model_sk = linear_model.LogisticRegression(multi_class="multinomial", solver="lbfgs")
model_sk.fit(X, df_t["label"])


Out[18]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [19]:
y_pred = model_sk.predict(X)

In [20]:
y_pred[:10]


Out[20]:
array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa'], dtype=object)

In [21]:
print metrics.classification_report(df_t["label"], y_pred)


                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        50
Iris-versicolor       0.94      0.94      0.94        50
 Iris-virginica       0.94      0.94      0.94        50

    avg / total       0.96      0.96      0.96       150


In [22]:
model_sk_t = tree.DecisionTreeClassifier()

In [24]:
model_sk_t.fit(X, df_t["label"])


Out[24]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [25]:
show_decision_tree(model_sk_t, X, df_t["label"])


Out[25]:

In [26]:
model_0r = dummy.DummyClassifier(strategy="most_frequent")
model_0r.fit(X, df_t["label"])
y_pred = model_0r.predict(X)
print metrics.classification_report(df_t["label"], y_pred)


                 precision    recall  f1-score   support

    Iris-setosa       0.33      1.00      0.50        50
Iris-versicolor       0.00      0.00      0.00        50
 Iris-virginica       0.00      0.00      0.00        50

    avg / total       0.11      0.33      0.17       150

/content/smishra8/SOFTWARE/anaconda2/envs/datamining/lib/python2.7/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

In [27]:
cm = metrics.confusion_matrix(y_true=df_t["label"], y_pred=y_pred)

In [28]:
df_cm = pd.DataFrame(cm, columns=model_0r.classes_, index=model_0r.classes_)

In [29]:
df_cm


Out[29]:
Iris-setosa Iris-versicolor Iris-virginica
Iris-setosa 50 0 0
Iris-versicolor 50 0 0
Iris-virginica 50 0 0

In [30]:
_ = get_confusion_matrix(model_0r, X, df_t["label"])


                 precision    recall  f1-score   support

    Iris-setosa       0.33      1.00      0.50        50
Iris-versicolor       0.00      0.00      0.00        50
 Iris-virginica       0.00      0.00      0.00        50

    avg / total       0.11      0.33      0.17       150

                 Iris-setosa  Iris-versicolor  Iris-virginica
Iris-setosa               50                0               0
Iris-versicolor           50                0               0
Iris-virginica            50                0               0

In [31]:
_ = get_confusion_matrix(model_sk_t, X, df_t["label"])


                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        50
Iris-versicolor       1.00      1.00      1.00        50
 Iris-virginica       1.00      1.00      1.00        50

    avg / total       1.00      1.00      1.00       150

                 Iris-setosa  Iris-versicolor  Iris-virginica
Iris-setosa               50                0               0
Iris-versicolor            0               50               0
Iris-virginica             0                0              50

In [32]:
_ = get_confusion_matrix(model_sk, X, df_t["label"])


                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        50
Iris-versicolor       0.94      0.94      0.94        50
 Iris-virginica       0.94      0.94      0.94        50

    avg / total       0.96      0.96      0.96       150

                 Iris-setosa  Iris-versicolor  Iris-virginica
Iris-setosa               50                0               0
Iris-versicolor            0               47               3
Iris-virginica             0                3              47

Plot decision regions

We can only do this if our data has 2 features


In [33]:
y, X = patsy.dmatrices("label ~ petalwidth + petallength - 1", data=df_t, return_type="dataframe") 
# -1 forces the data to not generate an intercept

In [34]:
X.columns


Out[34]:
Index([u'petalwidth', u'petallength'], dtype='object')

In [35]:
y = df_t["label"]

In [36]:
clf = tree.DecisionTreeClassifier()
clf.fit(X, y)
_ = get_confusion_matrix(clf, X, y)


                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        50
Iris-versicolor       1.00      0.98      0.99        50
 Iris-virginica       0.98      1.00      0.99        50

    avg / total       0.99      0.99      0.99       150

                 Iris-setosa  Iris-versicolor  Iris-virginica
Iris-setosa               50                0               0
Iris-versicolor            0               49               1
Iris-virginica             0                0              50

In [37]:
clf.feature_importances_


Out[37]:
array([ 0.93507842,  0.06492158])

In [38]:
show_decision_tree(clf, X, y)


Out[38]:

In [39]:
X.head()


Out[39]:
petalwidth petallength
0 0.2 1.4
1 0.2 1.4
2 0.2 1.3
3 0.2 1.5
4 0.2 1.4

In [40]:
y.value_counts()


Out[40]:
Iris-virginica     50
Iris-versicolor    50
Iris-setosa        50
Name: label, dtype: int64

In [42]:
plot_decision_regions(clf, X, y, col_x="petalwidth", col_y="petallength")


Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fad687ad750>

Naive Bayes classifier


In [43]:
clf = naive_bayes.GaussianNB()
clf.fit(X, y)
_ = get_confusion_matrix(clf, X, y)


                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        50
Iris-versicolor       0.94      0.94      0.94        50
 Iris-virginica       0.94      0.94      0.94        50

    avg / total       0.96      0.96      0.96       150

                 Iris-setosa  Iris-versicolor  Iris-virginica
Iris-setosa               50                0               0
Iris-versicolor            0               47               3
Iris-virginica             0                3              47

Decision surface of Naive Bayes classifier will not have overlapping colors because of the basic code I am using to show decision boundaries. A better code can show the mixing of colors properly


In [44]:
plot_decision_regions(clf, X, y, col_x="petalwidth", col_y="petallength")


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fad66d64850>

Logistic regression


In [45]:
clf = linear_model.LogisticRegression(multi_class="multinomial", solver="lbfgs")
clf.fit(X, y)
_ = get_confusion_matrix(clf, X, y)


                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        50
Iris-versicolor       0.96      0.94      0.95        50
 Iris-virginica       0.94      0.96      0.95        50

    avg / total       0.97      0.97      0.97       150

                 Iris-setosa  Iris-versicolor  Iris-virginica
Iris-setosa               50                0               0
Iris-versicolor            0               47               3
Iris-virginica             0                2              48

In [46]:
plot_decision_regions(clf, X, y, col_x="petalwidth", col_y="petallength")


Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fad68dfaf90>

IBk of K-nearest neighbors classifier


In [47]:
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
_ = get_confusion_matrix(clf, X, y)


                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        50
Iris-versicolor       1.00      0.98      0.99        50
 Iris-virginica       0.98      1.00      0.99        50

    avg / total       0.99      0.99      0.99       150

                 Iris-setosa  Iris-versicolor  Iris-virginica
Iris-setosa               50                0               0
Iris-versicolor            0               49               1
Iris-virginica             0                0              50

In [48]:
plot_decision_regions(clf, X, y, col_x="petalwidth", col_y="petallength")


Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fad687d0ed0>

In [ ]: